Wildlife recognition in nature documentaries with weak supervision from subtitles and external data
نویسندگان
چکیده
We propose a weakly supervised framework for domain adaptation in a multi-modal context for multi-label classification. This framework is applied to annotate objects such as animals in a target video with subtitles, in the absence of visual demarcators. We start from classifiers trained on external data (the source, in our setting ImageNet), and iteratively adapt them to the target dataset using textual cues from the subtitles. Experiments on a challenging dataset of wildlife documentaries validate the framework, with a final F1 measure of approximately 70%, which significantly improves over the results of a state-of-the-art approach, that is, applying classifiers trained on ImageNet without adaptation. The methods proposed here take us a step closer to object recognition in the wild and automatic video indexing.
منابع مشابه
Learning to Recognize Animals by Watching Documentaries: Using Subtitles as Weak Supervision
We investigate animal recognition models learned from wildlife video documentaries by using the weak supervision of the textual subtitles. This is a challenging setting, since i) the animals occur in their natural habitat and are often largely occluded and ii) subtitles are to a great degree complementary to the visual content, providing a very weak supervisory signal. This is in contrast to mo...
متن کاملSummarization of films and documentaries based on subtitles and scripts
We assess the performance of generic text summarization algorithms applied to films and documentaries, using the well–known behavior of summarization of news articles as reference. We use three datasets: (i) news articles, (ii) film scripts and subtitles, and (iii) documentary subtitles. Standard ROUGE metrics are used for comparing generated summaries against news abstracts, plot summaries, an...
متن کاملAdvancing human pose and gesture recognition
This thesis presents new methods in two closely related areas of computer vision: human pose estimation, and gesture recognition in videos. In human pose estimation, we show that random forests can be used to estimate human pose in monocular videos. To this end, we propose a co-segmentation algorithm for segmenting humans out of videos, and an evaluator that predicts whether the estimated poses...
متن کاملEmploying signed TV broadcasts for automated learning of British Sign Language
We present several contributions towards automatic recognition of BSL signs from continuous signing video sequences: (i) automatic detection and tracking of the hands using a generative model of the image; (ii) automatic learning of signs from TV broadcasts of single signers, using only the supervisory information available from subtitles; (iii) discriminative signer-independent sign recognitio...
متن کاملStelios Piperidis 1,2 Infrastructure for a Multilingual Subtitle Generation System
The expansion of digital television and the increasing demand to manipulate audiovisual content underlie the need for tools and systems that will automate the multilingual subtitle generation process. In this setting the MUSA project aims at providing a system which combines speech recognition, advanced text analysis, and machine translation to help generate multilingual subtitles. In its curre...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Pattern Recognition Letters
دوره 81 شماره
صفحات -
تاریخ انتشار 2016